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Abstract 

Image feature representation plays an essential role in 
image recognition and related tasks. The current state- 
of-the-art feature learning paradigm is supervised learn¬ 
ing from labeled data. However, this paradigm requires 
large-scale category labels, which limits its applicability 
to domains where labels are hard to obtain. In this paper, 
we propose a new data-driven feature learning paradigm 
which does not rely on category labels. Instead, we learn 
from user behavior data collected on social media. Con¬ 
cretely, we use the image relationship discovered in the la¬ 
tent space from the user behavior data to guide the image 
feature learning. We collect a large-scale image and user 
behavior dataset from Behance.net. The dataset consists 
of 1.9 million images and over 300 million view records 
from 1.9 million users. We validate our feature learning 
paradigm on this dataset and find that the learned feature 
significantly outperforms the state-of-the-art image features 
in learning better image similarities. We also show that the 
learned feature performs competitively on various recogni¬ 
tion benchmarks. 


1. Introduction 

Image recognition is a central problem in Computer Vi¬ 
sion which enjoys great progresses in the last decade. Fea¬ 
ture learning plays an essential role in image recognition. 
Traditional recognition methods, such as [10, 13, 4, 1, 18, 
12], are based on hand-crafted image features. These fea¬ 
ture representations require a significant amount of domain 
knowledge and do not generalize well to new domains. The 
current state-of-the-art feature learning paradigm is super¬ 
vised learning from data [9]. This data-driven supervised 
feature learning paradigm does not require domain knowl¬ 
edge but requires large datasets with category labels to train 
properly. However, collecting large labeled datasets is not 
an easy task even with the help of crowdsourcing. For in¬ 
stance, we often need experts to label images in a special 
domain which may be hard to find via crowdsourcing. Lack 
of large labeled datasets limits the applicability of the super- 



Figure 1: Comparisons of our feature and the ImageNet fea¬ 
ture [9] for image similarity on the Behance data. The left¬ 
most images are the queries and the rest are the top-5 near¬ 
est neighbor images based on the cosine similarity of the 
features (no re-ranking). The top rows are from our feature 
and the bottom rows are from the ImageNet feature. One 
can observe that our feature yields more similar images in 
terms of both content and style. 


vised feature learning paradigm in new problem domains. 

Since there are plenty of unlabeled image data, the gen¬ 
eral directions in the literature to overcome the dataset is¬ 
sue include unsupervised feature learning [2, 11, 16, 15, 17] 
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and transfer learning (using labeled data from different do¬ 
mains) [14]. These methods hold great promise, but regard¬ 
ing how successful they are, there remains an interesting 
research question: Are category-level labels the only way 
for data driven feature learning? 

There is a surge of social media websites in the last ten 
years. Most social media websites such as Pinterest have 
been collecting content data that the users share as well as 
behavior data of the users. User behavior data are the activ¬ 
ities of individual users, such as likes, comments, or view 
histories and they carry rich information about correspond¬ 
ing content data. For instance, two photos of a similar style 
on Pinterest tend to be pinned by the same user. If we ag¬ 
gregate the user behavior data across many users, we may 
recover interesting properties of the content. For instance, 
the photos liked by a group of users of similar interests tend 
to have very similar styles. 

In this paper, we propose a new paradigm for data driven 
image feature learning which we call collaborative feature 
learning. The main idea in collaborative feature learning 
is to learn image features from user behavior data on social 
media. In particular, we use the user behavior data collected 
on social media to recover latent representations of individ¬ 
ual images and learn a feature transformation from the im¬ 
ages to the recovered latent representations. It is a major 
departure from the existing paradigms on feature learning 
such as supervised learning in that we do not rely on cate¬ 
gory labels at all. There are several challenges in this new 
feature learning paradigm. In particular, user behavior data 
can be very sparse and noisy. For instance, most users only 
see a very small portion of all the images and some user be¬ 
havior data are erroneous. Fortunately, there exist structures 
in the behavior data over all the users and the structures can 
be exploited to deal with sparsity and noise in the data. 

We acknowledge that our new data-driven feature learn¬ 
ing paradigm only applies where there are social media data 
available and the effectiveness is determined by the quality 
of the data. In fact, all the data-driven methods share the 
same limitation. To test our feature learning paradigm, we 
collect a large-scale dataset from Behance.net which is a 
popular social media that focuses on artists and designers. 
We download about 1.9 million Behance artworks along 
with the view history of about 1.9 million users. That results 
in more than 300 million user-artwork view records. In the 
experiments, we find that the learned latent representations 
indeed refiect rich visual and semantic information of the 
images. We further observe that the image features learned 
from the latent representations not only perform well on 
standard image recognition benchmarks but outperform the 
state-of-the-art feature (the ImageNet feature [9]) on tasks 
such as finding images of similar styles (see Figure 1). 


1.1. Main contributions 

We propose a new paradigm for feature learning from 
social media. We completely forgo the use of category la¬ 
bels in existing feature learning paradigms. Instead, we use 
user behavior data collected on social media. Our paradigm 
can take advantage of the massive data that are collected 
on social media which mitigate the dataset scalability issue 
in feature learning and image recognition in general. We 
further validate and test our paradigm on large-scale data 
collected from a real-world social media website, and show 
promising results. Finally, we want to remark that although 
the focus of this paper is image and visual data, our fea¬ 
ture learning paradigm is by no means limited to learning 
visual features. For instance, it can be used to learn inter¬ 
esting audio features from social media websites such as 
Spotify [19]. 

1.2. Related work 

Image features play an important role in various image 
recognition problems. There is a rich body of literature in 
Computer Vision on image features. It is beyond the scope 
of this paper to do a comprehensive review. Early methods 
[10, 13] use low-level features which are more about ap¬ 
pearance and recent methods, such as [4, 1,18, 12], focus 
on high-level features which are more about semantics. Dif¬ 
ferent from hand-crafted features, features learned directly 
from data are the current state-of-the-art [9]. Data-driven 
features are shown to be able to effectively encode both se¬ 
mantics and appearance and outperform previous methods 
on many recognition benchmarks. But they need a lot of 
labeled images (on the order of millions) to train properly. 
Unsupervised feature learning methods, just to name a few 
[2, 11, 16, 15, 17], hold significant promise in terms of over¬ 
coming the labeled dataset limitation. 

In terms of using social media data for learning, our work 
is related to [19] and [20]. [19] addresses the cold start 
problem in music recommendation. It mines latent factors 
from user music listening logs and uses convolutional neu¬ 
ral networks as a nonlinear regressor to predict the latent 
factors. Different from [19], this work focuses on how to 
learn image features from user behavior data. This work 
is also different from [20] which uses deep belief nets to 
fuse and learn unified features for multi-modal social media 
data. Instead of learning features, [20] starts from low-level 
features of multi-modal social media data and focuses on 
learning a common feature map for various tasks. 

Our method uses singular value decomposition based 
collaborative filtering which is a well studied area in rec- 
ommender systems [5]. In particular, we adopt the ideas in 
[ ] to handle implicit feedback data and combine them with 
the negative sampling strategy proposed in [3]. 


2. Collaborative Feature Learning 

The proposed approach is a framework that unifies la¬ 
tent factor analysis and deep convolutional neural network 
for image feature learning from social media. Although 
rich social information can be harvested from social web¬ 
sites, such as content items, item tags, user social friend¬ 
ships, user views and comments, we focus on the simple 
form of user-item view data in this work to keep our feature 
learning framework general. Given a set of content items 
X = {/i, ..., 1m) and a set of users U = {Ui ,... ,Un}, 
the corresponding user-item view data is in the format of a 
matrix between X and U, which is denoted slsVg 
We set Vij = 1 if Uj viewed content item Ii (regardless 
the number of views), and Vij = — 1 otherwise (missing 
entries). Note that we use —1 to denote the missing en¬ 
tries, which should not be confused with negative signals 
mentioned in following sections. As we will show later, the 
user-item view matrix encodes a lot of information about 
the similarity between different content items, which we 
can use for supervised image feature learning for the so¬ 
cial content. The content items could be any media format 
presumably, such as videos, images, or audios. In this work, 
we will focus on images and leave feature learning for other 
media formats as future work. 

Figure 2 provides an overview of our approach. Based 
on the user-item view matrix, we use collaborative filtering 
to decompose it into the product between content item la¬ 
tent factors and user latent factors. As the latent factors of 
content items encode rich information about the similarity 
between the content items, we then generate pseudo classes 
for the content items by clustering their corresponding la¬ 
tent factors using K-means. Deep convolutional neural net¬ 
work (DCNN) [9] is then trained based on these pseudo 
classes in the traditional supervised way. Finally, the trained 
DCNN can be used to extract content features for our social 
content domain. In the following sections, we will cover 
details of our approach. 

2.1. Collaborative filtering for latent factor analysis 

In this part, we explain how to extract content latent fac¬ 
tors from the user-item view matrix. 


Implicit feedback data. The view matrix we consider 
here only records the minimum information of whether a 
user viewed a particular content item or not. This is called 
implicit feedback in the literature, which is an indirect re¬ 
flection of a user’s true opinion of the content item. Com¬ 
pared with explicit feedback data, such as user ratings on 
Netfiix about movies and on Amazon about products, where 
explicit positive or negative feedbacks are given, implicit 
feedback data is more general and at a much larger scale. 
For example, Amazon can collect many more clicks and 
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Figure 2: Approach overview. 


views than getting reviews or comments from the buyer in 
a much easier way. However, implicit feedback data is typ¬ 
ically a lot noisier and weaker indication of the user’s true 
opinion. Since it does not contain explicit negative signals, 
there is no easy way to identify negative signals from the 
missing data, because a missing entry could be interpreted 
as a sign of dislike or the user just has not discovered the 
content yet. Given the massive amount of contents in social 
media, the user-item view matrix is extremely sparse (e.g., 
well over 99% entries are missing), and it is very likely that 
most of the missing entries are due to that the content has 
not been discovered by the user. 

To address this issue, inspired by the experience from an 
industrial competition on building music recommendation 
systems [3], we sample “negatives” from the large number 
of missing entries with a probabilistic trick. When draw¬ 
ing the negative samples, the sampling follows a probability 
distribution that is proportional to the popularity of the con¬ 
tent. The rationale behind our sampling is that, a popular 
content has a higher chance of being discovered by the user, 
and therefore, a missing entry is more likely to suggest an 
negative altitude for the user. The popularity of a content 
item is a measure of how much exposure it receives from 
users, i.e., how many users have viewed it. Formally, given 
the view matrix V, we define the popularity pi of item f as 

Pi = V (1) 

Based on the content popularity, the sampling distribution 
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for negative data is defined as: 


Prij oc 


Pii ^ij — 1 

0 , = 1 


( 2 ) 


Apparently, we avoid sampling entries that are positive. In 
practice, we take the logarithm of Vij , and normalize Pvij 
with respect to each user, so that for each user, the sampling 
probabilities sum up to 1. Algorithm 1 provides a descrip¬ 
tion of our sampling process to get negative data, where A/" 
is the set of sampled negative view entries. For every miss¬ 
ing entry Vij G A/", we set Vij = 0 for further analysis. 


Matrix factorization. Among different methods in col¬ 
laborative filtering for latent factor analysis, matrix- 
factorization-based models have gained a lot popularity 
thanks to their attractive accuracy and scalability. In this 
work, we will focus on using matrix factorization model on 
our user-item view matrix. The model associates each user 
Uj with a user latent factor vector Xj G and each item 

li with an item latent factor vector i/i G where d \I\ 

and d \U\ is the dimension of the latent space. And 
the prediction for an entry Vij is done by taking the inner 
product between the two latent factors, i.e., Vij = yf Xj. 
With the positive entries and sampled negative entries from 
our user-item view matrix, we conduct matrix factorization 
with regularizations as 

min {Vij - yf + \{\\xj\\^ + \\yi\\^) (3) 

Here A is the weight placed on the regularization term. Note 
that the summation is over “non-missing” entries only in V, 
including both positives and the sampled negatives. Due 
to the large size of our problem (on the magnitude of mil¬ 
lions), we adopted stochastic gradient descent (SGD) to 
solve Equation (3). At each iteration of SGD, a single non¬ 
missing entry in V is randomly picked, and the partial gra¬ 
dients with regard to the involved Xj and yi are calculated in 
order to update them. We further improve the optimization 
efficiency with asynchronous SGD, which updates parame¬ 
ters in parallel for multiple non-missing entries of V. Be¬ 
cause the user-item view matrix is extremely sparse, there 
is little chance of parameter update conflict for our asyn¬ 
chronous SGD. In practice, we find that asynchronous SGD 
can significantly speed up the optimization without com¬ 
prising the quality of the solution. We also find that it has a 
stable convergence behavior in practice. 

2.2. Image feature learning with pseudo classes 

As we will see in the experiment section, the computed 
latent factors of our content items encode rich high-level 
visual and semantic information of the corresponding con¬ 
tent items. Therefore, the learned latent factors can serve as 


Algorithm 1 Negative Sampling 

Input: initial view matrix V G sampling probabil¬ 

ity Pvij , number of negatives to sample rij for each user 
Output: updated view matrix with negative sample set. 
l: 

2: for j = 1, ..., do 

3: JV ^ sample rij negatives according to Pvij 

4; \/i G A/', Vij = 0 

5: end for 


a good source of supervision for learning meaningful fea¬ 
tures for our social media, i.e., image visual features in this 
work. Inspired by the recent work of deep convolutional 
neural network (DCNN) on large-scale ImageNet [9] being 
able to learn quite generic visual features, we use the latent 
factors to create pseudo classes for our content items and 
then apply DCNN to learn high level features in a super¬ 
vised way. 


Learning with pseudo classes. Even though could be 
noisy, the latent factors can largely reflect the semantic re¬ 
lationship between our content items. Although this source 
of information can be used in different forms to learn vi¬ 
sual features, such as learning from sampled triplets, we re¬ 
sort to a more traditional supervised way by creating pseudo 
classes from the content items. Specifically, we first cluster 
the latent factor space into K clusters {ai,..., a^} with k- 
means. Then we create the pseudo classes {ci,..., c^} by 
partitioning the content items based on the cluster index of 
its latent factor, i.e.. 


Ck = {li ■ Vi e ak]- (4) 

Finally, we use the same deep neural network structure pro¬ 
posed in [9], which contains five convolutional layers and 
two fully connected layers, to learn a K-way DCNN classi¬ 
fication model, from which we can extract high-level visual 
features for our social content. 

Since the latent factor space is continuous, the above pro¬ 
cedure might suffer from suspected quantization problem. 
Alternatively, we could learn a DCNN regression function 
directly from our content item to its latent factor [19], which 
defines a continues mapping between the visual images and 
latent factors. However, in practice, we find that training 
the network in this way results in inferior features. First, the 
latent space is a high-dimensional continuous space; learn¬ 
ing a regression function directly from image pixels to this 
continuous space is very slow. Second, the latent factors ob¬ 
tained from the user behavior data is noisy and encode both 
visual semantic information and some amount of pure social 
and cultural information. Directly enforcing the mapping 
between image pixels and noisy latent factors could screw 






up the learning process, especially when we are using an ^2 
norm cost function that is not robust to outliers. By creating 
discrete pseudo classes, which is essentially a lossy vector 
quantization coding transform, we could use the more ro¬ 
bust softmax loss function that is robust to outliers. On the 
other hand, as the latent factors are noisy themselves, it may 
not matter much for the quantization loss. In our experi¬ 
ments, we find that learning from discrete pseudo classes 
can produce satisfactory visual features. 

3. Behance 2M Dataset 

To validate our new feature learning paradigm, we col¬ 
lect a large-scale image and user behavior dataset from Be- 
hance.net. Behance is a popular social media website for 
professional photographers, artists, and designers to share 
their work. Content data on Behance are mostly in the form 
of images and there are a small portion of videos as well. 
Content is very diverse, ranging from photographs to car¬ 
toons, paintings, typographs, graphic design, etc. Content 
data on Behance are organized as projects, each of which 
has associated images and videos. The website organizes 
all its projects into 67 fields. Each project may be associ¬ 
ated with multiple fields. (Note that fields are very coarse 
categorization and have large overlaps between each other. 
Therefore, they are not suitable as labels for image classifi¬ 
cation training.) In Figure 3, we show several images from 
some representative fields. 

The content data on Behance are all shared by the users. 
The project owner, who uploads the project, picks one of 
the most representative images as the cover image, which 
will be presented to other users. While browsing over a 
large number of cover images, a user simply clicks the cover 
image of interest and will be directed to the project page 
with the entire content. Behance records the view data for 
each project which is a list of users who have viewed the 
project. Choosing Behance as our testbed is motivated by 
the observation that a user tends to view projects of similar 
contents or styles, which is the key to our approach. 

We first download all the cover images for 1.9 million 
projects, and for each project we obtain the list of users who 
have viewed it, which results in 326 million view records 
from 1.9 million users. Note that we only download the 
cover images and use them in our experiments. But it is 
possible to obtain the additional project content. The den¬ 
sity of the view matrix (without negative sampling) is about 
0.0093%. 

We further process the raw data by removing the most 
popular and the least popular projects, because projects with 
too many or too few views cannot be modeled properly by 
latent factors. Similarly, we remove the users that are too 
active and inactive. The minimum and maximum thresh¬ 
olds we use for both project view counts and user view 
count are (10, 20000). The thresholds are chosen to retain 


most of the data. In particular, after thresholding, we have 
1.9 million project and 310 million view records from 0.93 
million users. The density of the view matrix goes up to 
0.0176%. Although the most and least popular projects and 
active users have been removed, the distribution of views on 
projects, as well as the activity (the number of views a user 
gives) of users, are still uneven and long tail. 

4. Experiments 

In this section, we first study the characteristics of our 
learned latent factors and analyze the information captured. 
Then we investigate the neural network based visual feature, 
and evaluate its performance on Behance dataset for image 
similarity, style classification and image category classifica¬ 
tion on standard benchmarks. 

4.1. Experiment details 

Datasplit. We split the processed data carefully for fair 
experiments. First, we split the set of all images as 95% and 
5% as training and testing. For clarity, we will denote them 
as tr and te for the rest of the paper. This first split is for fea¬ 
ture learning, thus tr is used for DCNN training and te is 
for evaluating learned feature representation. While train¬ 
ing DCNN, we leave a small portion of tr for validation 
purpose. Accordingly, the user behavior matrix after nega¬ 
tive sampling V is split based on tr and te: we form matrix 
Vtr by slicing V with images belong to tr, and similarly we 
have Vte- Next, we split Vtr for latent factor analysis. Note 
that, because the goal of latent factor analysis is to recover 
the missing entries in V, the split is on the non-missing en¬ 
tries in Vtr- We randomly sample 80% of the non-missing 
entries for training, and the rest 20% for validation. 

Latent factor learning. To infer latent factors of projects 
in tr, we first sample negative entries for V following Al¬ 
gorithm 1. For each user j, we set the number of nega¬ 
tive samples, rij, to be twice the number of her/his posi¬ 
tives. Therefore, V has about 900 million non-missing en¬ 
tries, so the matrix density becomes 0.0527%. Note that 
as rij gets larger, the computational cost to solve Equation 
(3) increases linearly, thus for computational efficiency we 
avoid sampling too many negatives. We then apply regular¬ 
ized matrix factorization on the training split of Vtr, and use 
the validation split to estimate the optimal parameter value 
for A. The validation criteria are root-mean-square error 
(RMSE) on predicting the validation split of Vtr with in¬ 
ferred latent factors, and personalized ranking (PR), which 
measures the rank of positive and negative items for each 
user. Note that both RMSE and PR are used in an indus¬ 
trial machine learning competition [3]. Then the entire Vtr 
is used to compute final latent factors with the optimal A. 
We also experimented with the dimensionality d of x and 
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Figure 3: A subset of images from our Behance dataset. Images are organized according to the associated project fields, 
which are a very coarse categorization. Eight representative fields are shown. One may observe that the content on Behance 
is of high quality (professional photos, artwork, and designs) and very diverse (covering many different domains). 


y. Generally, very high dimensional latent space (large 
d) may cause overfitting and increase computational cost, 
while very low ones may fail to capture the latent struc¬ 
ture. Our validation process shows that setting A to 0.01 
and d to 100 is a reasonable balance which achieves good 
RMSE and PR, while keeping the computation efficient. 
With this setting, we obtain validation RMSE of 0.2955, 
and PR of 0.1891. Note that small values for both measure¬ 
ments means a good performance, and PR’s expectation at 
random guess is 0.5. Please refer to [3] for more details of 
RMSE and PR. 

Image feature learning. Eollowing previous steps, 
we run k-means clustering on the learned latent 
factors. We vary the number of clusters within 
[200,500,1000,2000,3000,5000], and train a DCNN 
for each of the the corresponding cluster assignments. 
Similar to the data preparation pipeline in Krizhevsky et 
al. [9], we resize training images so that the short side 
has 256 pixels, then take the center crop of the resized 
images. At training time, we use random 224 x 224 crops 
to augment the dataset for improved robustness. Dropout 
is adopted to avoid overfitting. Training approximately 
converged in 60 epochs, and we extract features from fully 
connected layers in DCNNs. 

4.2. Analysis of latent factors 

As discussed earlier, latent factors from view data should 
reveal some properties of the content data. Since the la¬ 


tent factors also serve as an implicit supervision in our fea¬ 
ture learning, it is natural to learn what information they 
have captured about the individual images and whether our 
assumption of the correlation structure is valid. To this 
end, we present a simple experiment to empirically study 
those latent factors. Eor each project, represented by its 
cover image, we retrieve its nearest neighbors in the training 
set tr, using cosine similarity between latent factors. We 
find strong visual and semantic proximity between query 
projects and their nearest neighbors (NNs), and the observa¬ 
tion is consistent across the entire set. In Eigure 4, We show 
several randomly selected queries of various semantics and 
their top NNs. Eor clear illustration, we also list represen¬ 
tative tags of the queries. Note that tags are obtained from 
Behance, and they are provided by project owners at upload 
time, therefore, tags are typically random and noisy. (We 
only show the informative ones here.) 

Erom Eigure 4, we can first observe a clear categori¬ 
cal correlation at coarse level between a query and its NNs 
across all examples. Eor instance, row 1-3 are all portraits 
of woman, row 4-5 are about automotive design, and the 
following rows include various subjects and contexts, such 
as house, footwear, food photo etc. Eurthermore, we no¬ 
tice that the latent factors also reveal richer visual context 
at a finer level than category. Eor instance, despite all being 
female portraits, query 1-3 have drastically different image 
styles and contexts, and these differences are well respected 
in the retrieved NNs. Similar phenomenon is shown in row 
4 and 5, where the car in query 4 is a classic car, thus there 
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Figure 4: Query images and their retrieved nearest neighbors (NNs) in latent factor space. Queries and their NNs share 
similar visual semantics and contexts. Both coarse and fine level semantics are well captured (compare Row 1-3 and Row 
4-5). For easy illustration, representative tags of query images are listed as well. 


are more classic cars in its NNs while cars in row 5 are 
more of modern design. On the other hand, we also observe 
a small portion of failed queries, whose NNs are irrelevant. 
These failure cases are mainly caused by two reasons: (1) 
the sparsity and noise in user behavior data, (2) social fac¬ 
tors that are not visually related, e.g. an user always like to 
view his/her friends’ projects. 

4.3. Image similarity on Behance 

Given the latent factors analyzed previously, a natural 
question is how well our learned image feature captures the 
concept embedded in the latent space. To validate this, we 
conduct a retrieval experiment on the learned image fea¬ 
tures. We use images in te to query against tr. Similarity 
between images are calculated using cosine similarity. To 
validate the effectiveness of our learned feature, we com¬ 


pare it with the state-of-the-art image classification feature 
learned on ImageNet ILSVRC2012 dataset, with the same 
network structure and training procedure as [9], and we re¬ 
fer to it as ImageNet feature. 

We first visually inspect the visual relationship between 
a large number of queries and their NNs. In Figure 1, we 
show a set of randomly selected queries and the NNs for our 
feature and the ImageNet feature [9], and clearly our feature 
shows significantly better visual consistency. 

Furthermore, we quantitatively evaluate the NNs with 
two measurements, which are designed for Behance data. 
The measurements go beyond visual similarity, and most 
importantly, refiect the relationships between images on Be¬ 
hance. The measurements are: given a query image, 

(1) the number of common viewers between the query 
and the retrieved NNs. We measure, for each pair of query 





































Figure 5: Quantitative comparison of our feature and Im- 
ageNet feature for retrieval task on Behance images. Our 
feature performs better in both measurements. The black 
line is the value of the measurements obtained by random 
sampling image pairs (random retrieval and random NNs). 


and NN, the ratio between their common viewers (set size) 
and the union (set size), therefore, as shown in Figure 5, the 
ratio drops as the rank of neighbors falls further behind. We 
calculate this measure for the top 100 NNs for each query 
in te, and report its mean across te at every rank position; 

(2) the number of retrieved NNs that have been viewed 
by the owner of the query image. Similarly, this quantity is 
measured between a query and all of the top 100 NNs. We 
report the average of this number across all queries in te. 

Plots on the two measurements are shown in Figure 5, 
where our feature consistently outperforms ImageNet fea¬ 
ture. Because of marginal differences of DCNN features 
from Fully Connected Layer (FC) 6 and FC7 layers, we re¬ 
port the results of FC6 features. (For better comparison, we 
also randomly sample a large number of image pairs and 
calculate the random expectation of the two measurements.) 

4.4. Image classification on Benchmarks 

We apply our feature as image descriptor to image clas¬ 
sification on standard benchmarks for object class classi¬ 
fication and visual style classification. We choose Cal- 
tech256 [6] for object class classification. We focus more on 
image style classification, due to the nature of our dataset, 
which is crawled from an artistic asset sharing site. We use 
three major image style benchmarks, including Flickr style, 
which has 80k images on 20 visual styles. Wikipaintings, 
which contains 85k images belonging to 25 styles, and AVA 
style, which consists of 14k images with 14 photographic 
styles. Flickr style and Wikipaintings are both from a re¬ 
cent work [8], in which Karayev et al. evaluates various 
established image features on the three benchmarks. 

We select ImageNet feature [9] and Meta-Class [1] as 
competing features for their competitive performance. For 
deep learning features, we have experimented with features 
from both FC6 and FC7 layers, and found that for object 
classification, the performance difference is marginal, how¬ 
ever, for image style classification, due to dataset bias, fea¬ 


Table 1: Classification accuracy on benchmarks (%) 



Our feature 

ImageNet feature 

Meta-Class 

Flickr style 

37.2 

37.1 

32.8 

Wikipaintings 

41.4 

40.7 

38.6 

AVA style 

56.0 

51.3 

53.9 

Caltech256 

57.6 

68.9 

48.9 


tures in FC7 layer obtains lower accuracy. Therefore, we re¬ 
port results on FC6 features. We also experimented with our 
features trained on different number of pseudo classes, and 
found that 2000 consistently provides better performance 
than others. Each of the style benchmarks is split into 80% 
and 20% for training and testing, and for Caltech256, we 
use 50 images per category for training and 20 images for 
testing. We use linear SVM as the classification model. 

As shown in Table 1, in image style classification bench¬ 
marks, our feature achieves similar or even better accuracy 
than ImageNet feature, which is the state-of-the-art single 
feature for style recognition reported in [8]. As for object 
classification, our feature obtained competitive results on 
Caltech256. It is worth to point out that ImageNet feature is 
learned on ILSVRC2012 dataset with 1000 categorical la¬ 
bels, and Met-Class is trained on a subset of 8000 synsets of 
the entire ImageNet database, whereas our feature is learned 
on images with noisy user view data. 

5. Discussion 

We propose a novel data-driven feature learning 
paradigm that exploits user behavior data collected from so¬ 
cial media websites. Our feature learning paradigm is dif¬ 
ferent from existing methods in that it does not rely on cat¬ 
egory labels. We show that the learned feature outperforms 
the state-of-the-art features on our Behance 2M Dataset in 
terms of learning better image similarities while performing 
competitively on various standard recognition benchmarks. 
We believe using social data for feature learning and image 
recognition in general is a promising new research direc¬ 
tion. 

In terms of future work, we plan to pursue two directions. 
First, we want to make our learning paradigm more robust 
to data noise and non-visual factors. Second, we would like 
to go beyond the view records and incorporate additional 
social data, such as user relationships on social media. For 
more information, please visit the project webpage^ 
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